차세대 염기서열 결정기에서 생성된 리드의 효과적인 정렬을 위한 소수 기반의 해시 알고리즘과 클러스터링 방법

경규호; 박치현; 여윤구; 박상현; Kyuho Kyung; Chihyun Park; Yunku Yeu; Sanghyun Park

연구문헌

국내 학회지

홈 > 연구문헌 > 국내 학회지 > 데이터베이스 연구회지(SIGDB)

데이터베이스 연구회지(SIGDB)

Current Result Document :

한글제목(Korean Title)	차세대 염기서열 결정기에서 생성된 리드의 효과적인 정렬을 위한 소수 기반의 해시 알고리즘과 클러스터링 방법
영문제목(English Title)	Prime Number based Hash Algorithm and Clustering Approach for Effective Alignment of Reads from Next Generation Sequencing
저자(Author)	경규호 박치현 여윤구 박상현 Kyuho Kyung Chihyun Park Yunku Yeu Sanghyun Park
원문수록처(Citation)	VOL 28 NO. 02 PP. 0037 ~ 0053 (2012. 08)
한글내용 (Korean Abstract)	유전체 염기서열 정렬은 유전체 연구에서 가장 기본적이고 핵심적인 문제로 약 30억 개의 염기서열 문자로 구성된 레퍼런스 지놈 시퀀스에 염기서열 조각을 비교 정렬하여 맵핑되는 위치를 탐색하는 방법이다. 특히 최근 차세대 염기서열 분석(Next Generation Sequencing) 기술이 발전하면서 생성된 대량의 짧은 리드(read)를 빠르고 정확하게 레퍼런스 지놈 시퀀스에 정렬할 수 있는 방법에 대한 연구가 수행되고 있다. 짧은 리드 정렬 알고리즘은 대량의 데이터를 빠르고 정확하게 맵핑해야 하기 때문에 속도와 정확도에 주요한 의미를 두고 있지만 두 요소 사이의 트레이드오프(trade-off) 관계 때문에 두 가지 모두를 만족하는 알고리즘을 만들기란 매우 어렵다. 본 연구에서는 소수를 이용하여 A, C, G, T, N으로 이루어진 염기서열을 효과적으로 표현할 수 있는 새로운 해시 방법을 제안하고 미스매치(mis-match)를 고려할 수 있는 클러스터링(clustering) 방법과 비트(bit) 변환을 적용하여 정확하고 빠르게 염기서열을 정렬할 수 있는 알고리즘을 제시한다. 제안하는 방법의 우수성을 검증하기 위해 NCBI의 실제 인간 염색체를 레퍼런스 시퀀스로 사용하였고, 동일 시퀀스를 이용하여 만든 시뮬레이티드 데이터를 이용하여 BWA와 제안하는 방법에 대한 비교 실험을 수행하였다. 결과적으로 제안하는 방법이 비교 알고리즘과 비교하여 더 높은 정확도와 더 낮은 오류율이 확인 되었다.
영문내용 (English Abstract)	Sequence alignment which maps DNA sequence fragment into reference genome sequence composed of 3 billion nucleotides is basic and fundamental problem in genomic. With the advent of next generation sequencing(NGS) machine and developing the technology, the researches which can fast and accurately align a large amount of short reads into reference genome have been studied. Because an alignment method has to map short reads into reference fast and accurately, both of speed and accuracy are major factor. However, they are in the trade?off relation and it is difficult to make an algorithm which satisfies both two factors. In this paper, we propose the novel hash method which can present the genomic fragment composed of A, C, G, T and N with prime number. We also propose the clustering approach which can consider the mis?match and bit transformation approach for enhancing alignment speed. To verify the superiority of our method, we used the real genome sequence published by NCBI as a reference data and obtained the simulated data from that. We compared the performance with BWA algorithm using the simulated data. The results showed that our method had higher accuracy and lower error rate than the ones of comparative method.
키워드(Keyword)	차세대 염기서열 분석 염색체 정렬 해시 알고리즘 비트 변환 및 연산 소수 쌍둥이 소수 사촌 소수 빅 데이터 패턴 Next Generation Sequencing Genome Alignment Hash Algorithm Bit Transformation Twin Prime Cousin Prime Cousin Prime Big Data Pattern
파일첨부	PDF 다운로드